Use vectorcall for all-positional-argument calls #5896

swolchok · 2025-11-14T06:50:54Z

If a handle or object is called with only positional arguments, it is straightforward to use PyObject_Vectorcall instead of PyObject_CallObject.

Benchmarked by adding a trivial function to pybind11_benchmark:

    m.def("call_func_with_int", [](py::object func) {
      return func(py::cast(1));
    });

and then running python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'.

Before on M4 mac: 57.6 nsec per loop
After on M4 mac: 48.4 nsec per loop

For comparison, the included collatz benchmark takes 33.1 nsec per loop, just calling f(1) directly takes 17.8 nec per loop, and simply running pass takes 4.19 nsec per loop.

Suggested changelog entry:

Use vectorcall for simple C++-to-Python calls (only positional arguments, no tuple expansion).

If a handle or object is called with only positional arguments, it is straightforward to use PyObject_Vectorcall instead of PyObject_CallObject. Benchmarked by adding a trivial function to pybind11_benchmark: ``` m.def("call_func_with_int", [](py::object func) { return func(py::cast(1)); }); ``` and then running `python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'`. Before on M4 mac: 57.6 nsec per loop After on M4 mac: 48.4 nsec per loop For comparison, the included collatz benchmark takes 33.1 nsec per loop, just calling `f(1)` directly takes 17.8 nec per loop, and simply running `pass` takes 4.19 nsec per loop.

…1 and 14.

swolchok · 2025-11-14T22:08:58Z

include/pybind11/cast.h

+    // Disable warnings about useless comparisons when N == 0.
+    PYBIND11_WARNING_PUSH
+    PYBIND11_WARNING_DISABLE_GCC("-Wtype-limits")
+    PYBIND11_WARNING_DISABLE_INTEL(186)


not sure why suppressing the icc warning didn't work :(

swolchok · 2025-11-17T18:25:28Z

FWIW, I attempted to extend this to use vectorcall for the unpacking/kwargs/etc. cases, but my naive/straightforward attempt ended up adding too much cost for kwarg handling to end up actually improving performance. (the tuple unpacking case in particular is bottlenecked on args_proxy having to use slow PyIter-based iteration instead of PyTuple_GET_ITEM as currently written, so we couldn't help there either even though it doesn't use kwargs.)

rwgk · 2025-11-29T01:41:42Z

include/pybind11/cast.h

 /// Helper class which collects only positional arguments for a Python function call.
 /// A fancier version below can collect any argument, but this one is optimal for simple calls.
-template <return_value_policy policy>
+template <size_t N, return_value_policy policy>


IIUC there are two downsides with this approach:

code size increases increases because there will be a template instantiation for each N.

code complexity increases significantly, making future maintenance more difficult.

The obvious question: Is it really worth it, for real-world situations?

rwgk · 2025-11-29T01:43:14Z

include/pybind11/cast.h


 private:
-    tuple m_args;
+    std::array<PyObject *, N> m_args;


Could we get away from the N template instantiations by having a maximum N (capacity) and runtime num_args?

rwgk · 2025-11-29T01:46:10Z

include/pybind11/cast.h

 /// Collect only positional arguments for a Python function call
 template <return_value_policy policy,
          typename... Args,
          typename = enable_if_t<args_are_all_positional<Args...>()>>


Connected to the capacity idea above: enable_if only if <= capacity, is that feasible?

I think we don't have to be stingy with the capacity, e.g. 128 (pointers) would seem totally fine, so in practice this would hardly ever go to the fallback.

swolchok added 6 commits November 13, 2025 15:51

Make simple_collector non-copyable and non-movable

ddcc5d2

Restore PyObject_CallObject compatibility path for old Python versions

9120b9c

Fix the fix for Python 3.8. Allow moving of simple_collector for C++1…

d5a88c3

…1 and 14.

suppress -Wtype-limits

7e66762

suppress intel version of -Wtype-limits

f6aaf68

swolchok commented Nov 14, 2025

View reviewed changes

swolchok added 3 commits November 14, 2025 14:12

Try putting the suppression at class level

7529fbe

try suppressing for each loop

6c1b2f7

Suppress for NVCC as well

278f081

gentlegiantJGC mentioned this pull request Nov 15, 2025

Don't allow keep_alive or call_guard on properties #5533

Merged

rwgk reviewed Nov 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use vectorcall for all-positional-argument calls #5896

Use vectorcall for all-positional-argument calls #5896

swolchok commented Nov 14, 2025

Uh oh!

swolchok Nov 14, 2025

Uh oh!

swolchok commented Nov 17, 2025

Uh oh!

rwgk Nov 29, 2025

Uh oh!

rwgk Nov 29, 2025

Uh oh!

rwgk Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use vectorcall for all-positional-argument calls #5896

Are you sure you want to change the base?

Use vectorcall for all-positional-argument calls #5896

Conversation

swolchok commented Nov 14, 2025

Suggested changelog entry:

Uh oh!

swolchok Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok commented Nov 17, 2025

Uh oh!

rwgk Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants